A heuristic for morpheme discovery based on string edit distance
نویسندگان
چکیده
This paper derives from work we have been doing on unsupervised learning of the morphology of languages with rich morphologies, that is, with a high average number of morphemes per word. Our focus in this paper is Swahili, a major Bantu language of East Africa, and our goal is the development of a system that can automatically produce a morphological analyzer of a text on the basis of a large corpus. While a certain amount of work in computational linguistics has already been done on Swahili, our specific goal is a system that can quickly and accurately perform a morphological analysis of any of the approximately 500 Bantu languages when presented with data from it, and little or no computational work currently exists for 99% of them. Our work reported here extends Linguistica, an open source system available at http://linguistica.uchicago.edu.
منابع مشابه
The SED heuristic for morpheme discovery: a look at Swahili
This paper describes a heuristic for morphemeand morphology-learning based on string edit distance. Experiments with a 7,000 word corpus of Swahili, a language with a rich morphology, support the effectiveness of this approach.
متن کاملRefining The SED Heuristic For Morpheme Discovery: Another Look At Swahili
This paper describes a heuristic for morphemeand morphology-learning based on string edit distance. Experiments with a 7,000 word corpus of Swahili, a language with a rich morphology, support the effectiveness of this approach.
متن کاملA Comparison of String Distance Metrics for Name-Matching Tasks
Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators , token-based distance metrics, and hybrid methods. Overall, the best-performing method is a hybrid s...
متن کاملA Comparison of String Metrics for Matching Names and Records
We describe an open-source Java toolkit of methods for matching names and records. We summarize results obtained from using various string distance metrics on the task of matching entity names. These metrics include distance functions proposed by several different communities, such as edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. We ...
متن کاملEfficiently Supporting Edit Distance Based String Similarity Search Using B $^+$-Trees
Edit distance is widely used for measuring the similarity between two strings. As a primitive operation, edit distance based string similarity search is to find strings in a collection that are similar to a given query string using edit distance. Existing approaches for answering such string similarity queries follow the filter-and-verify framework by using various indexes. Typically, most appr...
متن کامل